{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Choosing blocking rules to optimise runtime\n",
        "\n",
        "<a target=\"_blank\" href=\"https://colab.research.google.com/github/moj-analytical-services/splink/blob/master/docs/demos/tutorials/03_Blocking.ipynb\">\n",
        "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
        "</a>\n",
        "\n",
        "To link records, we need to compare pairs of records and decide which pairs are matches.\n",
        "\n",
        "For example consider the following two records:\n",
        "\n",
        "| first_name | surname | dob        | city   | email               |\n",
        "| ---------- | ------- | ---------- | ------ | ------------------- |\n",
        "| Robert     | Allen   | 1971-05-24 | nan    | roberta25@smith.net |\n",
        "| Rob        | Allen   | 1971-06-24 | London | roberta25@smith.net |\n",
        "\n",
        "These can be represented as a pairwise comparison as follows:\n",
        "\n",
        "| first_name_l | first_name_r | surname_l | surname_r | dob_l      | dob_r      | city_l | city_r | email_l             | email_r             |\n",
        "| ------------ | ------------ | --------- | --------- | ---------- | ---------- | ------ | ------ | ------------------- | ------------------- |\n",
        "| Robert       | Rob          | Allen     | Allen     | 1971-05-24 | 1971-06-24 | nan    | London | roberta25@smith.net | roberta25@smith.net |\n",
        "\n",
        "For most large datasets, it is computationally intractable to compare every row with every other row, since the number of comparisons rises quadratically with the number of records.\n",
        "\n",
        "Instead we rely on blocking rules, which specify which pairwise comparisons to generate. For example, we could generate the subset of pairwise comparisons where either first name or surname matches.\n",
        "\n",
        "This is part of a two step process to link data:\n",
        "\n",
        "1.  Use blocking rules to generate candidate pairwise record comparisons\n",
        "\n",
        "2.  Use a probabilistic linkage model to score these candidate pairs, to determine which ones should be linked\n",
        "\n",
        "**Blocking rules are the most important determinant of the performance of your linkage job**.\n",
        "\n",
        "When deciding on your blocking rules, you're trading off accuracy for performance:\n",
        "\n",
        "- If your rules are too loose, your linkage job may fail.\n",
        "- If they're too tight, you may miss some valid links.\n",
        "\n",
        "This tutorial clarifies what blocking rules are, and how to choose good rules.\n",
        "\n",
        "## Blocking rules in Splink\n",
        "\n",
        "In Splink, blocking rules are specified as SQL expressions.\n",
        "\n",
        "For example, to generate the subset of record comparisons where the first name and surname matches, we can specify the following blocking rule:\n",
        "\n",
        "```python\n",
        "from splink import block_on\n",
        "block_on(\"first_name\", \"surname\")\n",
        "```\n",
        "\n",
        "When executed, this blocking rule will be converted to a SQL statement with the following form:\n",
        "\n",
        "```sql\n",
        "SELECT ...\n",
        "FROM input_tables as l\n",
        "INNER JOIN input_tables as r\n",
        "ON l.first_name = r.first_name AND l.surname = r.surname\n",
        "```\n",
        "\n",
        "Since blocking rules are SQL expressions, they can be arbitrarily complex. For example, you could create record comparisons where the initial of the first name and the surname match with the following rule:\n",
        "\n",
        "```python\n",
        "from splink import block_on\n",
        "block_on(\"substr(first_name, 1, 2)\", \"surname\")\n",
        "```\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Devising effective blocking rules for prediction\n",
        "\n",
        "The aims of your blocking rules are twofold:\n",
        "\n",
        "1. Eliminate enough non-matching comparison pairs so your record linkage job is small enough to compute\n",
        "2. Eliminate as few truly matching pairs as possible (ideally none)\n",
        "\n",
        "It is usually impossible to find a single blocking rule which achieves both aims, so we recommend using multiple blocking rules.\n",
        "\n",
        "When we specify multiple blocking rules, Splink will generate all comparison pairs that meet any one of the rules.\n",
        "\n",
        "For example, consider the following blocking rule:\n",
        "\n",
        "`block_on(\"first_name\", \"dob\")`\n",
        "\n",
        "This rule is likely to be effective in reducing the number of comparison pairs. It will retain all truly matching pairs, except those with errors or nulls in either the `first_name` or `dob` fields.\n",
        "\n",
        "Now consider a second blocking rule:\n",
        "\n",
        "`block_on(\"email\")`.\n",
        "\n",
        "This will retain all truly matching pairs, except those with errors or nulls in the `email` column.\n",
        "\n",
        "Individually, these blocking rules are problematic because they exclude true matches where the records contain typos of certain types. But between them, they might do quite a good job.\n",
        "\n",
        "For a true match to be eliminated by the use of these two blocking rules, it would have to have an error in _both_ `email` AND (`first_name` or `dob`).\n",
        "\n",
        "This is not completely implausible, but it is significantly less likely than if we'd used a single rule.\n",
        "\n",
        "More generally, we can often specify multiple blocking rules such that it becomes highly implausible that a true match would not meet at least one of these blocking criteria. This is the recommended approach in Splink. Generally we would recommend between about 3 and 10, though even more is possible.\n",
        "\n",
        "The question then becomes how to choose what to put in this list.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Splink tools to help choose your blocking rules\n",
        "\n",
        "Splink contains a number of tools to help you choose effective blocking rules. Let's try them out, using our small test dataset:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-07-17T08:01:01.481605Z",
          "iopub.status.busy": "2024-07-17T08:01:01.481304Z",
          "iopub.status.idle": "2024-07-17T08:01:01.500325Z",
          "shell.execute_reply": "2024-07-17T08:01:01.499540Z"
        },
        "tags": [
          "hide_input"
        ]
      },
      "outputs": [],
      "source": [
        "# Uncomment and run this cell if you're running in Google Colab.\n",
        "# !pip install splink"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-07-17T08:01:01.507471Z",
          "iopub.status.busy": "2024-07-17T08:01:01.507076Z",
          "iopub.status.idle": "2024-07-17T08:01:04.027392Z",
          "shell.execute_reply": "2024-07-17T08:01:04.026677Z"
        },
        "tags": []
      },
      "outputs": [],
      "source": [
        "from splink import DuckDBAPI, block_on, splink_datasets\n",
        "\n",
        "df = splink_datasets.fake_1000"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Counting the number of comparisons created by a single blocking rule\n",
        "\n",
        "On large datasets, some blocking rules imply the creation of trillions of record comparisons, which would cause a linkage job to fail.\n",
        "\n",
        "Before using a blocking rule in a linkage job, it's therefore a good idea to count the number of records it generates to ensure it is not too loose:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-07-17T08:01:04.031221Z",
          "iopub.status.busy": "2024-07-17T08:01:04.030942Z",
          "iopub.status.idle": "2024-07-17T08:01:04.225615Z",
          "shell.execute_reply": "2024-07-17T08:01:04.224859Z"
        },
        "tags": []
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'number_of_comparisons_generated_pre_filter_conditions': 1632,\n",
              " 'number_of_comparisons_to_be_scored_post_filter_conditions': 473,\n",
              " 'filter_conditions_identified': '',\n",
              " 'equi_join_conditions_identified': 'SUBSTR(l.first_name, 1, 1) = SUBSTR(r.first_name, 1, 1) AND l.\"surname\" = r.\"surname\"',\n",
              " 'link_type_join_condition': 'where l.\"unique_id\" < r.\"unique_id\"'}"
            ]
          },
          "execution_count": 3,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from splink.blocking_analysis import count_comparisons_from_blocking_rule\n",
        "\n",
        "db_api = DuckDBAPI()\n",
        "\n",
        "br = block_on(\"substr(first_name, 1,1)\", \"surname\")\n",
        "\n",
        "counts = count_comparisons_from_blocking_rule(\n",
        "    table_or_tables=df,\n",
        "    blocking_rule=br,\n",
        "    link_type=\"dedupe_only\",\n",
        "    db_api=db_api,\n",
        ")\n",
        "\n",
        "counts"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-07-17T08:01:04.261760Z",
          "iopub.status.busy": "2024-07-17T08:01:04.261491Z",
          "iopub.status.idle": "2024-07-17T08:01:04.314843Z",
          "shell.execute_reply": "2024-07-17T08:01:04.314299Z"
        }
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'number_of_comparisons_generated_pre_filter_conditions': 4827,\n",
              " 'number_of_comparisons_to_be_scored_post_filter_conditions': 372,\n",
              " 'filter_conditions_identified': 'LEVENSHTEIN(l.surname, r.surname) < 2',\n",
              " 'equi_join_conditions_identified': 'l.first_name = r.first_name',\n",
              " 'link_type_join_condition': 'where l.\"unique_id\" < r.\"unique_id\"'}"
            ]
          },
          "execution_count": 4,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "br = \"l.first_name = r.first_name and levenshtein(l.surname, r.surname) < 2\"\n",
        "\n",
        "counts = count_comparisons_from_blocking_rule(\n",
        "    table_or_tables=df,\n",
        "    blocking_rule= br,\n",
        "    link_type=\"dedupe_only\",\n",
        "    db_api=db_api,\n",
        ")\n",
        "counts"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The maximum number of comparisons that you can compute will be affected by your choice of SQL backend, and how powerful your computer is.\n",
        "\n",
        "For linkages in DuckDB on a standard laptop, we suggest using blocking rules that create no more than about 20 million comparisons. For Spark and Athena, try starting with fewer than 100 million comparisons, before scaling up.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Finding 'worst offending' values for your blocking rule\n",
        "\n",
        "Blocking rules can be affected by skew:  some values of a field may be much more common than others, and this can lead to a disproportionate number of comparisons being generated.\n",
        "\n",
        "It can be useful to identify whether your data is afflicted by this problem. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-07-17T08:01:04.318472Z",
          "iopub.status.busy": "2024-07-17T08:01:04.318176Z",
          "iopub.status.idle": "2024-07-17T08:01:04.364407Z",
          "shell.execute_reply": "2024-07-17T08:01:04.363784Z"
        }
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>key_0</th>\n",
              "      <th>key_1</th>\n",
              "      <th>count_l</th>\n",
              "      <th>count_r</th>\n",
              "      <th>block_count</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Birmingham</td>\n",
              "      <td>Theodore</td>\n",
              "      <td>7</td>\n",
              "      <td>7</td>\n",
              "      <td>49</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>London</td>\n",
              "      <td>Oliver</td>\n",
              "      <td>7</td>\n",
              "      <td>7</td>\n",
              "      <td>49</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>London</td>\n",
              "      <td>James</td>\n",
              "      <td>6</td>\n",
              "      <td>6</td>\n",
              "      <td>36</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "        key_0     key_1  count_l  count_r  block_count\n",
              "0  Birmingham  Theodore        7        7           49\n",
              "1      London    Oliver        7        7           49\n",
              "2      London     James        6        6           36"
            ]
          },
          "execution_count": 5,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from splink.blocking_analysis import n_largest_blocks\n",
        "\n",
        "result = n_largest_blocks(    table_or_tables=df,\n",
        "    blocking_rule= block_on(\"city\", \"first_name\"),\n",
        "    link_type=\"dedupe_only\",\n",
        "    db_api=db_api,\n",
        "    n_largest=3\n",
        "    )\n",
        "\n",
        "result.as_pandas_dataframe()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In this case, we can see that `Oliver`s in `London` will result in 49 comparisons being generated.  This is acceptable on this small dataset, but on a larger dataset, `Oliver`s in `London` could be responsible for many million comparisons."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Counting the number of comparisons created by a list of blocking rules\n",
        "\n",
        "As noted above, it's usually a good idea to use multiple blocking rules. It's therefore useful to know how many record comparisons will be generated when these rules are applied.\n",
        "\n",
        "Since the same record comparison may be created by several blocking rules, and Splink automatically deduplicates these comparisons, we cannot simply total the number of comparisons generated by each rule individually.\n",
        "\n",
        "Splink provides a chart that shows the marginal (additional) comparisons generated by each blocking rule, after deduplication:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-07-17T08:01:04.368088Z",
          "iopub.status.busy": "2024-07-17T08:01:04.367810Z",
          "iopub.status.idle": "2024-07-17T08:01:04.642204Z",
          "shell.execute_reply": "2024-07-17T08:01:04.640945Z"
        },
        "tags": []
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "\n",
              "<style>\n",
              "  #altair-viz-ac97b89040984f068c205e79a9bd127f.vega-embed {\n",
              "    width: 100%;\n",
              "    display: flex;\n",
              "  }\n",
              "\n",
              "  #altair-viz-ac97b89040984f068c205e79a9bd127f.vega-embed details,\n",
              "  #altair-viz-ac97b89040984f068c205e79a9bd127f.vega-embed details summary {\n",
              "    position: relative;\n",
              "  }\n",
              "</style>\n",
              "<div id=\"altair-viz-ac97b89040984f068c205e79a9bd127f\"></div>\n",
              "<script type=\"text/javascript\">\n",
              "  var VEGA_DEBUG = (typeof VEGA_DEBUG == \"undefined\") ? {} : VEGA_DEBUG;\n",
              "  (function(spec, embedOpt){\n",
              "    let outputDiv = document.currentScript.previousElementSibling;\n",
              "    if (outputDiv.id !== \"altair-viz-ac97b89040984f068c205e79a9bd127f\") {\n",
              "      outputDiv = document.getElementById(\"altair-viz-ac97b89040984f068c205e79a9bd127f\");\n",
              "    }\n",
              "    const paths = {\n",
              "      \"vega\": \"https://cdn.jsdelivr.net/npm/vega@5?noext\",\n",
              "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm/vega-lib?noext\",\n",
              "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm/vega-lite@5.17.0?noext\",\n",
              "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm/vega-embed@6?noext\",\n",
              "    };\n",
              "\n",
              "    function maybeLoadScript(lib, version) {\n",
              "      var key = `${lib.replace(\"-\", \"\")}_version`;\n",
              "      return (VEGA_DEBUG[key] == version) ?\n",
              "        Promise.resolve(paths[lib]) :\n",
              "        new Promise(function(resolve, reject) {\n",
              "          var s = document.createElement('script');\n",
              "          document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
              "          s.async = true;\n",
              "          s.onload = () => {\n",
              "            VEGA_DEBUG[key] = version;\n",
              "            return resolve(paths[lib]);\n",
              "          };\n",
              "          s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
              "          s.src = paths[lib];\n",
              "        });\n",
              "    }\n",
              "\n",
              "    function showError(err) {\n",
              "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
              "      throw err;\n",
              "    }\n",
              "\n",
              "    function displayChart(vegaEmbed) {\n",
              "      vegaEmbed(outputDiv, spec, embedOpt)\n",
              "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
              "    }\n",
              "\n",
              "    if(typeof define === \"function\" && define.amd) {\n",
              "      requirejs.config({paths});\n",
              "      require([\"vega-embed\"], displayChart, err => showError(`Error loading script: ${err.message}`));\n",
              "    } else {\n",
              "      maybeLoadScript(\"vega\", \"5\")\n",
              "        .then(() => maybeLoadScript(\"vega-lite\", \"5.17.0\"))\n",
              "        .then(() => maybeLoadScript(\"vega-embed\", \"6\"))\n",
              "        .catch(showError)\n",
              "        .then(() => displayChart(vegaEmbed));\n",
              "    }\n",
              "  })({\"config\": {\"view\": {\"continuousWidth\": 300, \"continuousHeight\": 300}}, \"data\": {\"name\": \"data-98f9ef003c07a61c2c401d3a599f708c\"}, \"mark\": \"bar\", \"encoding\": {\"order\": {\"field\": \"cumulative_rows\"}, \"tooltip\": [{\"field\": \"blocking_rule\", \"title\": \"SQL Condition\", \"type\": \"nominal\"}, {\"field\": \"row_count\", \"format\": \",\", \"title\": \"Comparisons Generated\", \"type\": \"quantitative\"}, {\"field\": \"cumulative_rows\", \"format\": \",\", \"title\": \"Cumulative Comparisons\", \"type\": \"quantitative\"}, {\"field\": \"cartesian\", \"format\": \",\", \"title\": \"Total comparisons in Cartesian product\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"start\", \"title\": \"Comparisons Generated by Rule(s)\", \"type\": \"quantitative\"}, \"x2\": {\"field\": \"cumulative_rows\"}, \"y\": {\"field\": \"blocking_rule\", \"sort\": [\"-x2\"], \"title\": \"SQL Blocking Rule\"}}, \"height\": {\"step\": 20}, \"title\": {\"text\": \"Count of Additional Comparisons Generated by Each Blocking Rule\", \"subtitle\": \"(Counts exclude comparisons already generated by previous rules)\"}, \"width\": 450, \"$schema\": \"https://vega.github.io/schema/vega-lite/v5.9.3.json\", \"datasets\": {\"data-98f9ef003c07a61c2c401d3a599f708c\": [{\"blocking_rule\": \"(SUBSTR(l.first_name, 1, 1) = SUBSTR(r.first_name, 1, 1)) AND (l.\\\"surname\\\" = r.\\\"surname\\\")\", \"row_count\": 473, \"cumulative_rows\": 473, \"cartesian\": 499500, \"match_key\": \"0\", \"start\": 0}, {\"blocking_rule\": \"l.\\\"surname\\\" = r.\\\"surname\\\"\", \"row_count\": 1165, \"cumulative_rows\": 1638, \"cartesian\": 499500, \"match_key\": \"1\", \"start\": 473}, {\"blocking_rule\": \"l.\\\"email\\\" = r.\\\"email\\\"\", \"row_count\": 468, \"cumulative_rows\": 2106, \"cartesian\": 499500, \"match_key\": \"2\", \"start\": 1638}, {\"blocking_rule\": \"(l.\\\"city\\\" = r.\\\"city\\\") AND (l.\\\"first_name\\\" = r.\\\"first_name\\\")\", \"row_count\": 145, \"cumulative_rows\": 2251, \"cartesian\": 499500, \"match_key\": \"3\", \"start\": 2106}, {\"blocking_rule\": \"l.first_name = r.first_name and levenshtein(l.surname, r.surname) < 2\", \"row_count\": 30, \"cumulative_rows\": 2281, \"cartesian\": 499500, \"match_key\": \"4\", \"start\": 2251}]}}, {\"mode\": \"vega-lite\"});\n",
              "</script>"
            ],
            "text/plain": [
              "alt.Chart(...)"
            ]
          },
          "execution_count": 6,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from splink.blocking_analysis import (\n",
        "    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,\n",
        ")\n",
        "\n",
        "blocking_rules_for_analysis = [\n",
        "    block_on(\"substr(first_name, 1,1)\", \"surname\"),\n",
        "    block_on(\"surname\"),\n",
        "    block_on(\"email\"),\n",
        "    block_on(\"city\", \"first_name\"),\n",
        "    \"l.first_name = r.first_name and levenshtein(l.surname, r.surname) < 2\",\n",
        "]\n",
        "\n",
        "\n",
        "cumulative_comparisons_to_be_scored_from_blocking_rules_chart(\n",
        "    table_or_tables=df,\n",
        "    blocking_rules=blocking_rules_for_analysis,\n",
        "    db_api=db_api,\n",
        "    link_type=\"dedupe_only\",\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Digging deeper: Understanding why certain blocking rules create large numbers of comparisons\n",
        "\n",
        "Finally, we can use the `profile_columns` function we saw in the previous tutorial to understand a specific blocking rule in more depth.\n",
        "\n",
        "Suppose we're interested in blocking on city and first initial.\n",
        "\n",
        "Within each distinct value of `(city, first initial)`, all possible pairwise comparisons will be generated.\n",
        "\n",
        "So for instance, if there are 15 distinct records with `London,J` then these records will result in `n(n-1)/2 = 105` pairwise comparisons being generated.\n",
        "\n",
        "In a larger dataset, we might observe 10,000 `London,J` records, which would then be responsible for `49,995,000` comparisons.\n",
        "\n",
        "These high-frequency values therefore have a disproportionate influence on the overall number of pairwise comparisons, and so it can be useful to analyse skew, as follows:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "execution": {
          "iopub.execute_input": "2024-07-17T08:01:04.645665Z",
          "iopub.status.busy": "2024-07-17T08:01:04.645388Z",
          "iopub.status.idle": "2024-07-17T08:01:04.857248Z",
          "shell.execute_reply": "2024-07-17T08:01:04.856730Z"
        },
        "tags": []
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "\n",
              "<style>\n",
              "  #altair-viz-6df195f20cdf4a60b9bdc4a211af0119.vega-embed {\n",
              "    width: 100%;\n",
              "    display: flex;\n",
              "  }\n",
              "\n",
              "  #altair-viz-6df195f20cdf4a60b9bdc4a211af0119.vega-embed details,\n",
              "  #altair-viz-6df195f20cdf4a60b9bdc4a211af0119.vega-embed details summary {\n",
              "    position: relative;\n",
              "  }\n",
              "</style>\n",
              "<div id=\"altair-viz-6df195f20cdf4a60b9bdc4a211af0119\"></div>\n",
              "<script type=\"text/javascript\">\n",
              "  var VEGA_DEBUG = (typeof VEGA_DEBUG == \"undefined\") ? {} : VEGA_DEBUG;\n",
              "  (function(spec, embedOpt){\n",
              "    let outputDiv = document.currentScript.previousElementSibling;\n",
              "    if (outputDiv.id !== \"altair-viz-6df195f20cdf4a60b9bdc4a211af0119\") {\n",
              "      outputDiv = document.getElementById(\"altair-viz-6df195f20cdf4a60b9bdc4a211af0119\");\n",
              "    }\n",
              "    const paths = {\n",
              "      \"vega\": \"https://cdn.jsdelivr.net/npm/vega@5?noext\",\n",
              "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm/vega-lib?noext\",\n",
              "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm/vega-lite@5.17.0?noext\",\n",
              "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm/vega-embed@6?noext\",\n",
              "    };\n",
              "\n",
              "    function maybeLoadScript(lib, version) {\n",
              "      var key = `${lib.replace(\"-\", \"\")}_version`;\n",
              "      return (VEGA_DEBUG[key] == version) ?\n",
              "        Promise.resolve(paths[lib]) :\n",
              "        new Promise(function(resolve, reject) {\n",
              "          var s = document.createElement('script');\n",
              "          document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
              "          s.async = true;\n",
              "          s.onload = () => {\n",
              "            VEGA_DEBUG[key] = version;\n",
              "            return resolve(paths[lib]);\n",
              "          };\n",
              "          s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
              "          s.src = paths[lib];\n",
              "        });\n",
              "    }\n",
              "\n",
              "    function showError(err) {\n",
              "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
              "      throw err;\n",
              "    }\n",
              "\n",
              "    function displayChart(vegaEmbed) {\n",
              "      vegaEmbed(outputDiv, spec, embedOpt)\n",
              "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
              "    }\n",
              "\n",
              "    if(typeof define === \"function\" && define.amd) {\n",
              "      requirejs.config({paths});\n",
              "      require([\"vega-embed\"], displayChart, err => showError(`Error loading script: ${err.message}`));\n",
              "    } else {\n",
              "      maybeLoadScript(\"vega\", \"5\")\n",
              "        .then(() => maybeLoadScript(\"vega-lite\", \"5.17.0\"))\n",
              "        .then(() => maybeLoadScript(\"vega-embed\", \"6\"))\n",
              "        .catch(showError)\n",
              "        .then(() => displayChart(vegaEmbed));\n",
              "    }\n",
              "  })({\"config\": {\"view\": {\"continuousWidth\": 400, \"continuousHeight\": 300}}, \"vconcat\": [{\"hconcat\": [{\"mark\": {\"type\": \"line\", \"interpolate\": \"step-after\"}, \"data\": {\"values\": [{\"percentile_ex_nulls\": 0.9777448177337646, \"percentile_inc_nulls\": 0.9850000143051147, \"value_count\": 15, \"group_name\": \"city_left_first_name_1_\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 15.0, \"distinct_value_count\": 352}, {\"percentile_ex_nulls\": 0.924332320690155, \"percentile_inc_nulls\": 0.9490000009536743, \"value_count\": 12, \"group_name\": \"city_left_first_name_1_\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 36.0, \"distinct_value_count\": 352}, {\"percentile_ex_nulls\": 0.9094955325126648, \"percentile_inc_nulls\": 0.9390000104904175, \"value_count\": 10, \"group_name\": \"city_left_first_name_1_\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 10.0, \"distinct_value_count\": 352}, {\"percentile_ex_nulls\": 0.8694362044334412, \"percentile_inc_nulls\": 0.9120000004768372, \"value_count\": 9, \"group_name\": \"city_left_first_name_1_\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 27.0, \"distinct_value_count\": 352}, {\"percentile_ex_nulls\": 0.8456973433494568, \"percentile_inc_nulls\": 0.8960000276565552, \"value_count\": 8, \"group_name\": \"city_left_first_name_1_\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 16.0, \"distinct_value_count\": 352}, {\"percentile_ex_nulls\": 0.7937685251235962, \"percentile_inc_nulls\": 0.8610000014305115, \"value_count\": 7, \"group_name\": \"city_left_first_name_1_\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 35.0, \"distinct_value_count\": 352}, {\"percentile_ex_nulls\": 0.7670623064041138, \"percentile_inc_nulls\": 0.8429999947547913, \"value_count\": 6, \"group_name\": \"city_left_first_name_1_\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 18.0, \"distinct_value_count\": 352}, {\"percentile_ex_nulls\": 0.7225519418716431, \"percentile_inc_nulls\": 0.812999963760376, \"value_count\": 5, \"group_name\": \"city_left_first_name_1_\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 30.0, \"distinct_value_count\": 352}, {\"percentile_ex_nulls\": 0.5979228615760803, \"percentile_inc_nulls\": 0.7289999723434448, \"value_count\": 4, \"group_name\": \"city_left_first_name_1_\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 84.0, \"distinct_value_count\": 352}, {\"percentile_ex_nulls\": 0.46439170837402344, \"percentile_inc_nulls\": 0.6389999985694885, \"value_count\": 3, \"group_name\": \"city_left_first_name_1_\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 90.0, \"distinct_value_count\": 352}, {\"percentile_ex_nulls\": 0.357566773891449, \"percentile_inc_nulls\": 0.5670000314712524, \"value_count\": 2, \"group_name\": \"city_left_first_name_1_\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 72.0, \"distinct_value_count\": 352}, {\"percentile_ex_nulls\": 0.0, \"percentile_inc_nulls\": 0.3259999752044678, \"value_count\": 1, \"group_name\": \"city_left_first_name_1_\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 241.0, \"distinct_value_count\": 352}, {\"percentile_ex_nulls\": 1.0, \"percentile_inc_nulls\": 1.0, \"value_count\": 15, \"group_name\": \"city_left_first_name_1_\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"sum_tokens_in_value_count_group\": 15.0, \"distinct_value_count\": 352}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"percentile_ex_nulls\", \"type\": \"quantitative\"}, {\"field\": \"percentile_inc_nulls\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"percentile_ex_nulls\", \"sort\": \"descending\", \"title\": \"Percentile\", \"type\": \"quantitative\"}, \"y\": {\"field\": \"value_count\", \"title\": \"Count of values\", \"type\": \"quantitative\"}}, \"title\": {\"text\": \"Distribution of counts of values in column city || left(first_name,1)\", \"subtitle\": \"In this col, 326 values (32.6%) are null and there are 352 distinct values\"}}, {\"mark\": \"bar\", \"data\": {\"values\": [{\"value_count\": 15, \"group_name\": \"city_left_first_name_1_\", \"value\": \"LondonJ\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 12, \"group_name\": \"city_left_first_name_1_\", \"value\": \"LondonH\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 12, \"group_name\": \"city_left_first_name_1_\", \"value\": \"LondonF\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 12, \"group_name\": \"city_left_first_name_1_\", \"value\": \"LondonE\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 10, \"group_name\": \"city_left_first_name_1_\", \"value\": \"LondonO\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 9, \"group_name\": \"city_left_first_name_1_\", \"value\": \"BirminghamT\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 9, \"group_name\": \"city_left_first_name_1_\", \"value\": \"LondonT\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 9, \"group_name\": \"city_left_first_name_1_\", \"value\": \"LondonL\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 8, \"group_name\": \"city_left_first_name_1_\", \"value\": \"CoventryL\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 8, \"group_name\": \"city_left_first_name_1_\", \"value\": \"SalfordL\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"value\", \"sort\": \"-y\", \"title\": null, \"type\": \"nominal\"}, \"y\": {\"field\": \"value_count\", \"title\": \"Value count\", \"type\": \"quantitative\"}}, \"title\": \"Top 10 values by value count\"}, {\"mark\": \"bar\", \"data\": {\"values\": [{\"value_count\": 1, \"group_name\": \"city_left_first_name_1_\", \"value\": \"BradfofrdL\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 1, \"group_name\": \"city_left_first_name_1_\", \"value\": \"LononR\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 1, \"group_name\": \"city_left_first_name_1_\", \"value\": \"LvpreoolR\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 1, \"group_name\": \"city_left_first_name_1_\", \"value\": \"LoodonT\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 1, \"group_name\": \"city_left_first_name_1_\", \"value\": \"BirminghaimT\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 1, \"group_name\": \"city_left_first_name_1_\", \"value\": \"BirgmhniamT\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 1, \"group_name\": \"city_left_first_name_1_\", \"value\": \"BrritsolR\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 1, \"group_name\": \"city_left_first_name_1_\", \"value\": \"PlymouthG\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 1, \"group_name\": \"city_left_first_name_1_\", \"value\": \"BirminmhagA\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}, {\"value_count\": 1, \"group_name\": \"city_left_first_name_1_\", \"value\": \"LondoonL\", \"total_non_null_rows\": 674, \"total_rows_inc_nulls\": 1000, \"distinct_value_count\": 352}]}, \"encoding\": {\"tooltip\": [{\"field\": \"value\", \"type\": \"nominal\"}, {\"field\": \"value_count\", \"type\": \"quantitative\"}, {\"field\": \"total_non_null_rows\", \"type\": \"quantitative\"}, {\"field\": \"total_rows_inc_nulls\", \"type\": \"quantitative\"}], \"x\": {\"field\": \"value\", \"sort\": \"-y\", \"title\": null, \"type\": \"nominal\"}, \"y\": {\"field\": \"value_count\", \"scale\": {\"domain\": [0, 15]}, \"title\": \"Value count\", \"type\": \"quantitative\"}}, \"title\": \"Bottom 10 values by value count\"}]}], \"$schema\": \"https://vega.github.io/schema/vega-lite/v5.9.3.json\"}, {\"mode\": \"vega-lite\"});\n",
              "</script>"
            ],
            "text/plain": [
              "alt.VConcatChart(...)"
            ]
          },
          "execution_count": 7,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from splink.exploratory import profile_columns\n",
        "\n",
        "profile_columns(df, column_expressions=[\"city || left(first_name,1)\"], db_api=db_api)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "!!! note \"Further Reading\"\n",
        "    :simple-readme: For a deeper dive on blocking, please refer to the [Blocking Topic Guides](../../topic_guides/blocking/blocking_rules.md).\n",
        "\n",
        "    :material-tools: For more on the blocking tools in Splink, please refer to the [Blocking API documentation](../../api_docs/blocking.md).\n",
        "\n",
        "    :bar_chart: For more on the charts used in this tutorial, please refer to the [Charts Gallery](../../charts/index.md#blocking).\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Next steps\n",
        "\n",
        "Now we have chosen which records to compare, we can use those records to train a linkage model.\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": ".venv",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.8"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}